首页> 外文OA文献 >Parallel clustering of high-dimensional social media data streams
【2h】

Parallel clustering of high-dimensional social media data streams

机译:高维社交媒体数据流的并行聚类

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

We introduce Cloud DIKW as an analysis environment supporting scientificdiscovery through integrated parallel batch and streaming processing, and applyit to one representative domain application: social media data streamclustering. Recent work demonstrated that high-quality clusters can begenerated by representing the data points using high-dimensional vectors thatreflect textual content and social network information. Due to the high cost ofsimilarity computation, sequential implementations of even single-passalgorithms cannot keep up with the speed of real-world streams. This paperpresents our efforts to meet the constraints of real-time social streamclustering through parallelization. We focus on two system-level issues. Moststream processing engines like Apache Storm organize distributed workers in theform of a directed acyclic graph, making it difficult to dynamicallysynchronize the state of parallel workers. We tackle this challenge by creatinga separate synchronization channel using a pub-sub messaging system. Due to thesparsity of the high-dimensional vectors, the size of centroids grows quicklyas new data points are assigned to the clusters. Traditional synchronizationthat directly broadcasts cluster centroids becomes too expensive and limits thescalability of the parallel algorithm. We address this problem by communicatingonly dynamic changes of the clusters rather than the whole centroid vectors.Our algorithm under Cloud DIKW can process the Twitter 10% data stream inreal-time with 96-way parallelism. By natural improvements to Cloud DIKW,including advanced collective communication techniques developed in our Harpproject, we will be able to process the full Twitter stream in real-time with1000-way parallelism. Our use of powerful general software subsystems willenable many other applications that need integration of streaming and batchdata analytics.
机译:我们将Cloud DIKW引入作为一种分析环境,通过集成的并行批处理和流处理来支持科学发现,并将其应用于一种代表性的领域应用:社交媒体数据流集群。最近的工作表明,可以使用反映文本内容和社交网络信息的高维向量表示数据点,从而生成高质量的聚类。由于相似度计算的成本高昂,即使是单遍历算法的顺序实现也无法跟上实际流的速度。本文介绍了我们为通过并行化来满足实时社交流集群的约束而做出的努力。我们专注于两个系统级问题。像Apache Storm这样的大多数流处理引擎都以有向无环图的形式组织分布式工作程序,因此很难动态同步并行工作程序的状态。我们通过使用pub-sub消息传递系统创建单独的同步渠道来解决此挑战。由于高维向量的稀疏性,当将新的数据点分配给聚类时,形心的大小会快速增长。直接广播集群质心的传统同步变得过于昂贵,并限制了并行算法的可扩展性。我们仅通过交流集群的动态变化而不是整个质心向量来解决这个问题。我们在Cloud DIKW下的算法可以通过96路并行性实时处理Twitter的10%数据流。通过对Cloud DIKW的自然改进,包括在Harpproject中开发的先进的集体通信技术,我们将能够以1000路并行性实时处理整个Twitter流。我们对功能强大的通用软件子系统的使用将使许多其他需要将流和批处理数据分析集成在一起的应用程序成为可能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号